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Abstract 

Motivated by the evident success of context-tree based methods in lossless data 
compression, we explore, in this paper, methods of the same spirit in universal prediction 
of individual sequences. By context-tree prediction, we refer to a family of prediction 
schemes, where at each time instant t, after having observed all outcomes of the data 
sequence xi, . . . , Xt-i, but not yet xt, the prediction is based on a "context" (or a state) 
that consists of the k most recent past outcomes Xt-k, • • ■ , Xt-i, where the choice of k 
may depend on the contents of a possibly longer, though limited, portion of the observed 
past, Xt-k^^^^ ■ ■ ■ ,Xt-i. This is different from the study reported in [1], where general 
finite-state predictors as well as "Markov" (finite-memory) predictors of fixed order, 
where studied in the regime of individual sequences. 

Another important difference between this study and [1] is the asymptotic regime. 
While in [1], the resources of the predictor (i.e., the number of states or the memory 
size) were kept fixed regardless of the length N of the data sequence, here we investigate 
situations where the number of contexts, or states, is allowed to grow concurrently with 
N. We are primarily interested in the following fundamental question: What is the 
critical growth rate of the number of contexts, below which the performance of the best 
context-tree predictor is still universally achievable, but above which it is not? We 
show that this critical growth rate is linear in N. In particular, we propose a universal 
context-tree algorithm that essentially achieves optimum performance as long as the 
growth rate is sublinear, and show that, on the other hand, this is impossible in the 
linear case. 

Index Terms: context-tree algorithm, universal prediction, finite-state machine, 
finite-memory machine, predictability, individual sequence. 

1 Introduction 

The problem of universal prediction of stochastic processes as well as individual sequences 
has received considerable attention throughout the years, in the literature pertaining to a 
large variety of disciplines, such as information theory, statistics, control theory, finance, 
and others (see [4] for a survey of some of the results on the theoretical aspects). 
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In [1], the problem of universal prediction of individual sequences relative to the class 
of finite-state predictors was investigated. Given an infinitely long binary sequence x = 
{xi,X2, ■ ■ ■), the finite-state predictability, 7r(a;), was defined as 

7r(a;) = lim Iimsup7r5'(xi, . . . jXat), (1) 

where Trs{xi, ■ ■ ■ ,xn) is the minimum relative frequency of prediction errors achieved among 
all finite-state (FS) predictors with no more than S states, when operating on the first N 
bits, xi,...,xjv, of the infinite sequence x. An FS predictor with S states, or, an S- 
state predictor for short, is in turn defined by a next-state function Sf+i = g{xt,st) G S, 
\S\ < S, which recursively updates the state upon receiving a new input, xt, and by an 
output function Xf+i = f{st), which provides the prediction of Xf+i. The main contribution 
in [1] was in proposing a universal (randomized) prediction scheme that achieves tt{x) for 
every x. This scheme was based on the incremental parsing procedure of the Lempel-Ziv 
algorithm [10]. Note that since Tr{x) is defined by taking the limit of 5 — 00 after the limit 
supremum over of AT — 00, the regime of the asymptotics dictates that N is very large 
compared to S. 

The present study differs from [1] in two main aspects. The first is that we confine 
attention to context-tree prediction, which means that the current state, st, does not 
necessarily evolve recursively according to a particular next-state function g, but may 
rather correspond to a certain context, that is, a certain portion of the most recent past 
{xt^k,Xt-k+i, ■ ■ ■ ,xt~i), where k may vary dynamically according to a certain suffix tree, 
which is subjected to design. The motivation for exploring context-tree strategies stems 
from their relative simplicity and their success in lossless data compression applications 
(see, e.g., [3], [5], [6], [7], [8], [9] and references therein). Quite recently, a context-tree ap- 
proach was analyzed also in universal prediction of stochastic processes under certain regu- 
larity conditions [2], [11], [12]. Also, as was shown in [1], the FS predictability is attainable 
by finite-memory predictors (also referred to as "Markov predictors" therein), where k is 
fixed, a-fortiori, it is attainable by the more general class of context-tree predictors, where 
k is allowed to vary. 

The second aspect of the difference between this work and [1] is that here we no longer 
confine ourselves to the regime where N » S. By allowing S to grow with N at & 
certain rate, the performance analysis pertaining to the relative effectiveness of context-tree 
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predictors may become more refined and informative in the sense that it has the potential to 
reveal their advantage over ordinary finite-memory predictors, which under the regime of [1] , 
are asymptotically as good as general FS predictors anyway, as mentioned above. Context- 
tree predictors are intuitively superior to finite-memory predictors of fixed order because, 
as in data compression, they allow the flcxiblility to allocate more memory resources (longer 
contexts) to the "typical" patterns, that occur more often than others, and less resources 
(shorter contexts) to the non-typical ones. 

The question that we pose then is the following: What is the critical growth rate of 
S = Sn diS function of N, such that below this rate, the asymptotic optimum context- 
tree prediction performance of every sequence is still universally achievable, but above this 
rate, it is not? The answer turns out to be that this critical rate is linear in A^. More 
precisely, if Sn = aN, (a - positive constant), then no universal predictor (deterministic or 
randomized) can attain the optimum context-tree prediction performance corresponding to 
aN contexts, simulatenously for all sequences. Furthermore, for a = 1, it is easy to show 
that the value of this optimum prediction performance (in terms of the relative error rate) 
is zero for any sequence. For a sublinear growth rate of Sn, on the other hand, we propose 
a universal context-based prediction algorithm, whose number of contexts grows slightly 
faster than Sn, and which asymptotically attains the context-tree predictability pertaining 
to Sn states, for every (xi, . . . , xjy). 

The outline of the paper is as follows. In Section 2, we give a formal definition of the 
problem and state the main result. Sections 3 and 4 are devoted to proofs. 

2 Problem Formulation and Main Result 

Let = {xi,X2, ■ ■ ■ ,xn), xt G {0, 1}, f = 1, . . . , N, designate a binary data sequence to 
be sequentially predicted. A context tree predictor with S contexts (or, with S leaves) is 
defined as follows. The output function, /(•), of the predictor is given by 

it+i = f{st), (2) 

where Xf+i G {0, 1} is the predicted value for Xf+i and Sf is the current context (or, state), 
which takes on values in a finite set S, \S\ < S, S being a positive integer. We allow also 
randomized output functions, namely, random selection of xt+i G {0, 1} with respect to 
(w.r.t.) a conditional probability distribution given st- The context st is determined from 
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the past, (. . . ,xt-i,xt), by the choice of a context tree, which is a complete^ binary tree 
with S leaves. At time t, after having observed xt, the context st is determined by reading 
off the most recent data symbols in reversed order (first xt, then xt-i, etc.) and traversing 
along the tree according to these symbols, starting at the root and ending at a leaf, unless 
the depth of this leaf is larger than t (which may happen at the beginning of the sequence) , 
in which case we stop at xi. Denoting the resulting depth by A; = k{. . . ,xt-i,xt), the 
context will then be given by st = {xt-k+i: ■ ■ ■ ,xt)-^ Thus, the context-tree is used as a 
suffix tree. A context-tree predictor with S contexts is then defined by a combination of 
a context-tree with a context set S and an output function / : — >^ {0, 1} (or a set of 
conditional distributions {P(-|s), s G iS} in the randomized case). We denote by Vs the 
class of all context-tree predictors with S contexts. 

Let us now expand the class of predictors Vs according to the following model: Given 
a total budget of S states, we have the freedom to split it into two subsets of states. One 
subset of states, of size S'-"' G {1,2, .. . ,S}, is dedicated to a context-tree of 5*^ leaves, 
as before (with S being replaced by S'-^). The states in this subset will be referred to as 
context-tree states. The other subset of states, of size S'^ < S — , is dedicated to a 
finite-state machine induced by a prefix tree, which is a complete binary tree with a total 
of S'^ nodes (including the root and the internal nodes, but not the leaves). The states in 
this subset will be referred to as transient states, and each one of the S'^ transient states 
corresponds to the root or to an internal node in the prefix tree. The system then works 
as follows: It begins at the subset of transient states, and the initial state, si, is always the 
root of the prefix tree. As long as st is an internal node (or the root) of this tree, the next 
state st+i = g{xt, st) is the child of st corresponding to the binary value of xt, provided that 
this child is an internal node as well, otherwise (i.e., if this child is a leaf), then the system 
passes to the subset of context-tree states, and then st+\ will be the context pertaining to 
time t + 1. Prom this point onward, the system remains in the subset of context states, and 
operates as described in the previous paragraph. Thus, the transient states arc used only 
at the beginning of the sequence, but at certain time t (that may depend on the contents of 
{x\, . . . ,xt)), there is a transition into the context state set. We refer to these two modes 

of operation of the system as the transient mode and the context-tree mode, respectively. 

^By complete binary tree, we refer to a binary tree where every node that is not a leaf has two children. 
^Note that k cannot exceed S — 1, and so, the context is actually determined by no more than the S —1 
most recent symbols. 
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Let us define Vg as the union, over all pairs of positive integers {(S'^, S"^) : S'^ + S'^ <S}, 



S'-^ leaves. The S-th order context predictability of x^, denoted k{x^ , S), is defined as the 
minimum fraction of errors"^ achieved over among all predictors in Vg. 

This structure, of a transient mode followed by the contcxt-trcc mode, can be motivated 
by the following consideration: Note that in the transient mode, which is active at the 
beginning of the sequence, the predictor is actually using the entire past, {xi, . . . ,xt), as 
its context. This usage of the entire past can be attributed, in a real-life situation, to 
"training," or "learning." During this training time, in addition to providing predictions, 
the system "learns," from the whole data available thus far, what are the "typical" patterns 
and then, on the basis of this study, it designs the context-tree predictor to be used in the 
context-tree mode, which will remain fixed thereafter. Since the total memory resources 
(given by S) are limited, they have to be divided between the training and the size of the 
context dictionary to be used in the context-tree mode. Thus, there is a tradeoff, but the 
definition of the class allows the full freedom with regard to the partition between S'^ 
transient states and context-tree states. On the one extreme, we can take = and 

= S, which is a pure context-tree predictor in Vs, with no transient mode at all. On 
the other extreme, we have = S — 1 and = 1, where resources are all devoted to the 
transient mode, and the context-tree has a root only, which means that the prediction xt+i 
is constant, independently of past data. 

Having defined Vg, let us now allow S grow with N, and accordingly, redefine the 
notation of the total number of states by Sn. For a monotonically non-decreasing sequence 
{Sn}n>i of positive integers, we say that the context predictability is universally achievable 
w.r.t. {Sn}n>i if there exists a randomized predictor (not necessarily a context predictor), 
xt = ft{xi, ■ ■ ■ , xt-i), t = 1,2, . . ., such that for every infinite sequence x = {xi,X2, ■ ■ ■) 



where the probabilities, Pr{xt 7^ Xf}, are w.r.t. the randomization. We say that a predictor 

acliicvos \ hv cout(\x( lucdict al)ility w.r.t. {Sn}n>i uniformly rapidly if the convergence in 

''when randomized output functions are allowed, this should be redefined as the minimum expected 
fraction of errors, where the expectation is w.r.t. the randomization. However, it is easy to see that the best 
output function is always deterministic. 



of all sets of combinations of a prefix tree with 



states and a suffix (context) tree with 



r 1 ^ 1 

lim sup — ^ Pr{xt 7^ xt} - k{x^ , Sn) < 0, 



(3) 
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eq. (3) is uniform, i.e., 

N 



lim sup max 



< 0. (4) 



The questions we address are the following: 



1. What is the fastest growth rate of {Sn} such that the context predictability is still 
universally achievable w.r.t. {>S'Ar}iv>i uniformly rapidly? 

2. Whenever the context predictability is universally achievable, can we propose a (sim- 
ple) universal predictor? 

Theorem 1 answers both questions and tells us that this critical growth rate is linear. 

Theorem 1 The context predictability w.r.t. {iS'Ar}Ar>i is universally achievable uniformly 
rapidly if and only i/limjv-+oo Sn/N = 0. 

Discussion: The proof of Theorem 1 consists of the sufficieny part, where a particu- 
lar universal (horizon-dependent) predictor is proposed (Section 3) and the necessity part 
(Section 4). As we shall see, the universal predictor proposed in Section 3, bases its pre- 
dictions on no more than 2N/Mn contexts, where {Mn}n>i is a sequence of positive 
integers tending to infinity such that Ymif^^ao Sn^j^j /N = 0, and so, the number of con- 
texts used by the algorithm must increase slightly faster than {Sat}. As will be seen 
in Section 3, the best choice of Mjv, in the sense of minimizing (the upper bound on) 
max^iVg{o,i}'v[(l/iV)EiliPr{xt ^ xt} - k{x^,Sn)\ is of the order of {N/SnT^^, which 
yields a redundancy of the order of {Sn /N)^/^ . It should be noted that it is also possible 
to obtain a redundancy rate of 0((5'jv log S'iv)/-/V), which may be better in some cases, by 
using the expert-advice methodology (cf. the relevant references in [4]), where the "experts" 
are all the members of Vg^^. However, the implementation of the expert-advice algorithm is 
extremely complex because it needs to apply all predictors of Vg^ in parallel. The proposed 
horizon-dependent algorithm is next modified to be horizon-independent. 

As for the necessity part of Theorem 1, we assume that Sn = aN + 1 for some positive 
constant a < 1, and demonstrate that there is a set of sequences {x^} for which, on the 
one hand, k{x^ ,aN -|- 1) = 0, but on the other hand, for every universal predictor (which 
may be deterministic or randomized, and with unlimited resources), at least one of these 
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sequences would yield no less than aiV/2 errors. Stated in the mathematical language, we 
have: 



max 

a;^e{0,l}^ 



>| (5) 



^ J2 Pi-{xt / xt} - Kix"", aN + 1) 

for all N, and so, when limjv-»oo Sn/N = a > 0, the context predictability is not universally 
achievable uniformly rapidly. The question of universal achievability which is not uniformly 
rapid, in the linear case, remains open. 

3 A Universal Prediction Scheme — Proof of Sufficiency 

For a given A'', choose a positive integer Mjv, and consider the following recursive definition 
of prediction context, which also defines the proposed algorithm. 

Let ko = ko{xi, . . . ,xt) denote the largest positive integer k such that the following two 
conditions hold at the same time: 

1. The string (xt-k+i, ■ ■ ■ , xt) appears (possibly, with overlaps) at least times along 
(xi, . . .,Xt). 

2. The string {xt-k+2-, ■ ■ ■ ,xt) has already been used as the prediction context at least 
Mjv times in the past. 

If no such k exists, define ko = 0. The string {xt-ko+i: ■ ■ ■ ,xt) is referred to as the prediction 
context used at time t, and in the case ko = 0, the context st is defined as "null," i.e., "no 
context." 

Next, consider the prediction scheme of [1], defined w.r.t. the prediction context St = 

{xt-ko+ij ■ ■ ■ j^t)- In particular, at each time instant t, determine the context using the 
above described rule, and randomly draw the prediction x^+i according to the conditional 
distribution pt{xt+i = l|st) = (p{pt{l\st) , N (st)) , where is defined as follows: 

a < ^ — e„ 

2^(a-i) + ^ i-en<«<^ + e„ (6) 

1 a > i + e„ 



^(a, n) 



with en = l/(2V^r+2), and where pt{l\s) = [Ntis, 1) + l/2]/[iVt(s) + 1], iVt(s) being the 
number of occurrences of the context s (w.r.t. the above rule) along [xi, . . . ,xt-i) and 
Nt{s, 1) is the number of times these appearances of context s were followed by "1". 

We next analyze the performance of this prediction scheme in comparison to the best 
reference predictor in Vg^, with a set Sjf of Sjf transient states, and a set of 



context states, Sj^ + < Sn- An upper bound on the redundancy, [{1/N) Pr{xt 7^ 
xt} — k{x^ , Sn)], will be obtained by bounding {l/N)J2^iP^{xt 7^ xt} from above, and 
bounding k{x^,Sn) from below. We begin with the latter by counting only errors that 



occur during the context-tree mode of the reference predictor, which lasts at least N — Sj 



time units, as the transient mode cannot last longer than Sjf instants. For the given x^ , 
let {si, . . . ,sn) be the sequence of states that would have been obtained had only the 
context-tree machine of the reference predictor been used, from t = 1 to t = N. As is 
shown in [1], the number of errors made by such a (pure context-tree) predictor is given 
by J2g(zsc mm{N{s,0),N{s, 1)}, where N{s,x), s G S^, x G {0, 1}, is the number of joint 
occurrences of st = s and xt+i = x along the pair of sequences (s^,a;-^). The joint count 
of St = s and xt+i = x, during the context-tree mode only, cannot then be smaller than 
N{s,x) — Sjf, and so. 



/i(x^,5jv) > 



> 



1 

N 
1 

N 



J2 mm{N{s,0),N{s,l)} - S' 



J2 mm{N{s,0),N{s,l)}-SN 



(7) 



As was also shown in [1], when the predictor (6) is applied, the contribtution of each state 
s to the expected number of prediction errors, ENg{s) = J2t:st=s^^{^t / ^t}, is upper 
bounded by 



ENe{s) < min{iV(s, 0), N{s, 1)} + ^iV(s) + 1 + ^, (8) 

where N{s) = N{s, 0) + N{s, 1) is the number of occurrences of s. 

Consider the above described universal prediction scheme applied to x^, and let us 
denote now the sequence of contexts, generated by this algorithm, as = {§1, . . . , S]\j) (to 
distinguish from the contexts of the context-tree component of the reference predictor of 
Vg^), and let Sn denote the set of contexts generated this way. 

We first observe that there are at most 2MnS^ times instants where st is a suffix of 
St G S^. This follows from the following consideration. In a full binary tree with leaves, 
like the tree corresponding to the reference predictor, there are always — 1 internal nodes 
(including the root), pertaining to all possible states which are suffixes of some state in S^. 
Now, by construction of the algorithm, every such internal node s' is used as a prediction 
context no more than 2M]\f times. This is because upon the {2Mj\f + l)-st time, either 



the pattern (0, s') or (l,s') has appeared at least Mjv times, and thus both conditions for 
extending the prediction context by one bit are satisfied. Thus, the total number of times 
that suffixes of contexts in are used as prediction contexts cannot exceed 2Mn{S^ ~ !)• 
We will further upper bound this number by 2A4nSn, for simplicity. 

In the remaining time instants, of course, cither st = St or becomes a suffix of Sf. 
Correspondingly, for a given s G S^, let Tg denote the sub-tree of prediction contexts, 
rooted at s, that are generated by the algorithm, i.e., all generated contexts {s} suffixed 
by s (including s itself as the root). Following eq. (8), the expected number of errors is 
bounded by 

1 ^ r I 11 

- Pv{xt ^ xt} < 2MmSn + E E min{Ar(s, 0), iV(s, 1)} + ^N{s) + 1 + - , (9) 

where the first term, 2MnSn, accounts for worst case of totally erroneous prediction at all 
2MnSn visits at states {s} that are suffixes of some states in S^, and the second term is an 
upper bound on the expected number of errors at all other times. Now, let us decompose 
the second term into 

^= E Yrnin{N{s,0),N{s,l)} (10) 



and 



^= E E 



N{s) + l + - 



(11) 



We shall now bound each one of them separately. As for A, we have 

A< E mini E^(«">o)> E^(^'l) 

< E min{Ar(s,0),A^(s,l)} 

ses^ 

< N ■k{x^,Sn) + Sn. (12) 



Regarding B, we have the following consideration: As mentioned earlier, for internal nodes 
in Tg (and a-fortiori for the leaves), we know that N[s) cannot exceed 2Mjv, and so, 

^ ^ E E (V2Mn + 1 + ^) = (V2Miv + 1 + ^) • E I'^^i- (13) 

Now, J2s€S^ I'^l course, upper bounded by the total number of contexts generated 
by the proposed universal predictor. As every internal node of the context-tree generated 



appears at least Mn times (by the second condition that defines the algorithm), the total 
number of internal nodes of Sn cannot exceed N/Mn, and so, the total number of nodes 
(including the leaves) cannot exceed 2N/Mn + 1. Thus, J2ses^ I'^l — "^N/Mn + 1, and we 
can further upper bound B by 

B<(V2SJ^+l).(|^ + l). (14) 

which upon normalizing by N becomes 



^,sN^.^.^\-h.^]. (15) 



N - \ \l Mn Mjf Mn) \ 2N 
The total expected excess frequency of errors (redundancy) is thus 



(2Mjv + 1)Sn 



(16) 



N 

where the additional term comes from the first term of the r.h.s. of eq. (9) and the right- 
most side of eq. (12). The conditions for vanishing redundancy are then M^r 00 and 
M^Sn/N 0. Both conditions can be satisfied at the same time as long as Sn is sublinear 
in N. As the r.h.s. is independent of , the convergence to zero is uniformly fast. This 
completes the proof of the sufficiency part. □ 
Two comments are in order at this point: 

1. Note that the asymptotically optimum growth rate of Mn (in the sense of minimizing 
the r.h.s.) is Mjv = 0{{N/SNf^^), which yields B/N < 0((Sjv/iV)V3). 

2. The above algorithm is horizon-dependent, i.e., the length of the sequence, A'^, has 
to be known ahead of time in order to determine the value of Mn- It is not difficult, 
however, to modify this algorithm so as to be horizon-independent. One way to do 
that is the following: Instead of defining the required number of context repetitions, 
in conditions 1 and 2 of the algorithm, to depend directly on N, let us define it as 
depending on k, the length of the examined context. More specifically, let us replace 
Mn by M{k) and by M{k — 1) in conditions 1 and 2, respectively, where {M(A;)}fe>i is 
a certain monotonic sequence of positive integers that tends to infinity. The reader is 
referred to the appendix for more details on the redundancy analysis and the consid- 
erations regarding the choice of the sequence {M(k)}. It is also demonstrated, in the 
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appendix, that the (upper bound on the) redundancy term of this algorithm decays 
faster than that of the LZ-based algorithm proposed in [1]. 

4 Proof of Necessity 

Let a G (0, 1] be given, and let Sn = aN + 1, assuming without essential loss of generality 
that aN is integer. Consider the recursive generation of a sequence by xt = f{st), 
t = 1,2, ...,N, where St is the state associated with previously generated symbols, and / 
is the output function, corresponding to a certain member in T^^jv+i- Clearly, when this 
predictor is applied to the very same sequence that it has generated, then there are no 
prediction errors, and so, k{x^ , aN + 1) = for every such sequence. 

Next, consider a subset of 2"^ pure transient-state predictors from "PaTv+ii i'*^-' P^^" 
dictors with Sj^ = aN and Sj^ = 1, whose associated x-sequences (generated as above) 
start with all 2"^ possible binary strings of length aN correspondingly. That is, the first 
predictor generates a sequence that begins with aN zeroes, the second predictor gen- 
erates a sequence whose first aN bits are (0, 0, 0, 1), and so on. Clearly, there are 
enough degrees of freedom to do that: Given any desired binary string {xi, . . . ,XaN) of 
the first aN bits of x^ , consider the finite-state (transient) machine corresponding to a 
prefix tree whose internal nodes are (the null string), {xi},{xi,X2}, ■ ■ ■ ,{xi, ...,XaN}, 
and whose leaves are {xi}, {xi, X2}, {xi, X2, X3}, . . . , {xi, X2, • • • , XaN-i,XaN}, Xi being the 
complement of Xj, i = l,...,aN. Now, apply to each of the internal nodes an out- 
put function that will give the next desired outcome, i.e., /(0) = xi, /({xi}) = X2, 
/({xi,X2}) = X3, . . . , /({xi, X2, . . . , XaAT-i}) = XaN- This construction guarantees that 
each one of the 2"^ context-tree predictors will generate a different sequence because all 
these sequences differ from each other even in their first aN bits. 

Finally, define a random vector , which is distributed uniformly across all these 
2°-^ TV-vectors. Now, for any randomized predictor, with no matter how many states, the 
expected fraction of errors (where the expectation is both w.r.t. the ensemble of and 
w.r.t. possible randomization) is lower bounded as follows: 

-. N ^ aN 

t=i t=i 
where the last equality is due to the fact that {Xi, . . . ,XaN) is, in fact, governed by the 
memory less binary symmetric source (independent, fair coin tosses) since the distribution is 
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uniform over all 2"^ strings on length aN. Clearly, every predictor makes exactly 50% errors 
on the binary symmetric source. It therefore follows that for any randomized predictor, there 
exists at least one vector a;-^, out of the above defined ensemble of 2"^ vectors, for which 
the expected fraction of errors is not below a/2. This completes the proof of the necessity 
part. 

Note that for the case a = 1, we have k{x^ ,N + 1) = for every sequence, but any 
predictor would perform at least as bad as random guessing (50% errors) on some sequence. 

Appendix 

In this appendix, wc show how the performance analysis of Section 3 should be modified 
if the horizon-dependent algorithm is replaced by the the horizon-independent algorithm 
described in the second comment at the end of Section 3. 

In analogy to eq. (9), we have two main redundancy terms: The first term is the 
summation of 2M{ds) over all internal nodes {s} of the context-tree (replacing the 
term 2MnSn), where dg stands for the depth of state s in the context-tree, i.e., the dis- 
tance from of s from the root. This term is further bounded by 2Sn max^g^c M{ds) = 
2S'jvM(max^g5C dg) < 2SNM{Sj\f), where we have used the fact that the deepest leaf in a 
compete tree with Sn leaves cannot be more than Sn branches away from the root. The 
second term is B, which is now upper bounded as follows: 

B = 



Nis) + 1 + - 



< 




< 
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^ sGS^ sesc ses^ 

where the first inequahty follows from the concavity of the square-root function, and the 
second to the last inequality follows from the Schwartz-Cauchy inequality. Now, X^^g^c l^^l, 
which is upper bounded by the total number of contexts generated by the algorithm, \Sn\, is 
in turn, upper bounded by the following consideration: Denoting by Sn, the set of internal 
nodes of Sn, we have for every positive integer j: 

sG<Sjv 

> _E ^0') 

> {\Sn\-2^ + 1)-M{j), (A.2) 

where we have used the fact that the number of nodes with depth less than j cannot exceed 
Yh=o 2' = 2-' — 1. We therefore have 

and so, 

OAT 

EJ^.I<|5»I<»« + ^. (A.4) 

which follows from the fact that in a complete binary tree with m internal nodes, the total 
number of nodes is 2m + 1. Since this is true for every j, we can take the minimum over j. 
Let us then denote 

MN) = — min f 2^+^ + . = 2min ( — + . (A.5) 

' N 3 \ M{j)) j M{j)) ^ ' 

We therefore obtain the following upper bound to the redundancy: 

The guidelines regarding the choice of the sequence {M{k)} are, in principle, aimed at 
minimizing the r.h.s. of the last inequality. Obvioulsy, the faster is the growth rate of 
{M{k)}, the faster decays, but on the other hand, the first term above is enlarged. 

Moreover, this dictates an interesting tradeoff with regard to universal achievability. If 
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one wishes to compete with the context predictabihty for every subUnear growth rate of 
{Sat}, then M{k) should be a constant Mq (otherwise SnM{Sn)/N may not tend to zero), 
but then ip{N) tends to a constant, which can be made arbitrarily small for large enough 
Mq. Thus, the context predictability is achieved within an arbitrarily small e > 0, but 
not strictly achieved. If, on the other hand, one is somewhat less ambitious, and is only 
interested in achieving the context predictability for slower sequences {S'at}, i.e., those for 
which {SnM{Sn)/N} still vanishes for a certain choice of the sequence {M{k)}, then this 
is accomplished by the algorithm. For example, if M{k) = 2^, then il){N) = 0{1/VN), but 
then {Sn} of the reference class is only allowed to grow slower than logarithmically in N, 
for the purpose of comparison. 

Finally, it is interesting to compare the performance of the proposed horizon-independent 
algorithm to that of the LZ-based algorithm of [1]. To this end, let us even assume that 
Sj^ = S = 2^ is fixed (not growing with N), and that our reference predictor is a pure 
context-tree algorithm (with no transient states), where the context-tree is the full binary 
tree whose leaves are all the 2^ binary A;-tuples, in other words, a finite-memory ("Markov") 
predictor of order k. In [1, Theorem 4], it is asserted that the (upper bound on the) re- 
dundancy of the LZ-based predictor w.r.t. this finite-memory predictor decays at the rate 
of 1 / \/log N. Here, on the other hand, if we choose, for example, M{k) = 2*^, as suggested 
above, then the redundancy would decay at the rate of N~^/^, which is better. Moroever, 
the choice M{k) = 2^ may not even be the best possible choice. One can come close to the 
rate of iV-^s by letting {M{k)} grow sufficiently rapidly. 

References 

[1] M. Feder, N. Merhav, and M. Gutman, "Universal prediction of individual sequences," 
IEEE Trans. Inform. Theory, vol. 38, no. 4, pp. 1258-1270, July 1992. 

[2] P. Jacquet, W. Szpankowski, and I. Apostol, "A universal predictor based on pattern 
matching," IEEE Trans. Inform. Theory, vol. 48, no. 6, pp. 1462-1471, June 2002. 

[3] A. Martin, G. Seroussi, and M. J. Weinberger, "Linear time universal coding and time 
reversal of tree sources via FSM closure," IEEE Trans. Inform. Theory, vol. 50, no. 7, 
pp. 1442-1468, July 2004. 



14 



[4] N. Merhav and M. Feder, "Universal prediction," IEEE Trans. Inform. Theory, vol. 
44, no. 6, pp. 2124-2147, October 1998. Also, in Information Theory: 50 Years of 
Discovery, pp. 80-103, Eds. S. Verdii and S. McLaughlin, IEEE Press, 1999. 

[5] Y. M. Shtarkov, T. J. Tjalkens, and F. M. J. Willems, "Multialphabet weighting uni- 
versal coding of context tree sources," Problems of Information Transmission (IPPI), 
vol. 33, no. 1, pp. 17-28, 1997. 

[6] M. J. Weinberger and G. Seroussi, "Sequential prediction and ranking in universal 
context modeling and data compression," HPL Technical Report no. HPL-94-111, 
November 1994. 

[7] M. J. Weinberger, G. Seroussi and G. Sapiro, "L0C0**2I: A low complexity, context- 
based, lossless image compression algorithm," Proc. DCC '96, Snowbird, Utah, March 
1996. 

[8] F. M. J. Willems, "The context-tree weighting method: extensions," IEEE Trans. 
Inform. Theory, vol. 50, no. 7, pp. 1442-1468, July 2004. 

[9] F. M. J. Willems, Y. M. Shtar'kov, and T. J. Tjalkens, "The context-tree weighting 
method: basic properties," IEEE Transactions on Information Theory, vol. 44, no. 2, 
pp. 792-798, March 1998. 

[10] J. Ziv and A. Lempel, "Compression of individual sequences via variable-rate coding," 
IEEE Trans. Inform. Theory, vol. IT-24, no. 5, pp. 530-536, September 1978. 

[11] J. Ziv, "An efficient universal prediction algorithm for unknown sources with limited 
training data," IEEE Trans. Inform. Theory, vol. 48, no. 6, pp. 1690-1693, June 2002. 

[12] J. Ziv, "Correction to: 'An efficient universal prediction algorithm for unknown 
sources with limited training data' [1]," IEEE Trans. Inform. Theory, vol. 50, no. 8, 
pp. 1851-1852, August 2004. 



15 



